定量融资中最基本的问题之一是存在适合给定一组选择的市场价格的连续时间扩散模型。传统上,人们采用直觉,理论和经验分析的组合来找到实现精确或近似拟合的模型。我们的贡献是展示该问题的合适游戏理论表述如何通过利用现代深层多代理强化学习中的现有发展来帮助解决这个问题,以在随机过程的空间中进行搜索。更重要的是,我们希望社区可以利用和扩展我们的技术来解决该领域的重要问题,例如SPX-VIX校准问题。我们的实验表明,我们能够学习局部波动性,以及在波动率过程中所需的路径依赖性,以最大程度地降低百慕大选项的价格。在一句话中,我们的算法可以看作是粒子方法\`{a} la Guyon et henry-labordere,而粒子而不是被设计为确保$ \ sigma_ {loc}}(t,s_t)^2 = \ mathbb { e} [\ sigma_t^2 | s_t] $,正在学习与更通用校准目标合作的RL驱动的代理。这是第一批使用衍生校准问题桥接加固学习的工作。
translated by 谷歌翻译
我们展示了一个新的财务框架,其中两个基于RL的代理商代表流动资金提供者和流动性的代理商同时学习,以满足他们的目标。由于参数化奖励制定和深度RL的使用,每组都会学习一个能够概括和插入广泛行为的共享政策。这是一步迈向全基于RL的市场模拟器复制复杂的市场条件,特别适合在各种情况下研究金融市场的动态。
translated by 谷歌翻译
Cheung和Piliouras(2020)最近表明,乘法权重更新方法的两个变体 - OMWU和MWU-显示的相反的收敛性属性取决于游戏是零和合作的。受这项工作的启发以及有关学习以优化单个功能的最新文献,我们引入了一个新的框架,用于学习在游戏中与NASH Eqeilibria的最后近期融合,在这种情况下,更新规则的系数(学习率)沿着轨迹学习了,这是由增强力学学习的以游戏性质为条件的学习策略:\ textit {游戏签名}。我们使用两人游戏的新分解构建后者,分成对应于交换性投影操作员的八个组件,从而概括和统一文献中研究的最新游戏概念。当学习系数时,我们比较了各种更新规则的性能,并表明RL策略能够利用各种游戏类型的游戏签名。在此过程中,我们介绍了CMWU,这是一种将共识优化扩展到受约束案例的新算法,对零和bimatrix游戏具有本地收敛保证,并证明它在具有恒定系数和跨系数的零和零游戏上都具有竞争性能学习系数时的频谱。
translated by 谷歌翻译
政策梯度方法可以解决复杂的任务,但是当动作空间或客观多重性的维度变得非常大时通常会失败。这部分地发生这种情况,因为基于刻度的梯度估计器的差异如二次方式缩放。在本文中,我们通过利用在新型动作目标影响网络中编码的独立结构的因子基线来解决这个问题。遵循的代表性政策梯度(FPG)提供了用于分析关键最先进的算法的常见框架,以概括传统的政策梯度,并产生了一种原因的方法,并在先前了解问题域的生成过程中。我们提供了对所提出的估算者的分析,并确定方差减少的条件。讨论了FPG的算法方面,包括最佳的策略分解,如最小的BICLique覆盖物所征用子,以及对错误指定网络的偏差差异的影响。最后,我们展示了我们对大规模强盗和交通交叉问题的算法的性能优势,为空间近似的形式提供了对后者的新贡献。
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
translated by 谷歌翻译
In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.
translated by 谷歌翻译
To circumvent the non-parallelizability of recurrent neural network-based equalizers, we propose knowledge distillation to recast the RNN into a parallelizable feedforward structure. The latter shows 38\% latency decrease, while impacting the Q-factor by only 0.5dB.
translated by 谷歌翻译
The problem of learning threshold functions is a fundamental one in machine learning. Classical learning theory implies sample complexity of $O(\xi^{-1} \log(1/\beta))$ (for generalization error $\xi$ with confidence $1-\beta$). The private version of the problem, however, is more challenging and in particular, the sample complexity must depend on the size $|X|$ of the domain. Progress on quantifying this dependence, via lower and upper bounds, was made in a line of works over the past decade. In this paper, we finally close the gap for approximate-DP and provide a nearly tight upper bound of $\tilde{O}(\log^* |X|)$, which matches a lower bound by Alon et al (that applies even with improper learning) and improves over a prior upper bound of $\tilde{O}((\log^* |X|)^{1.5})$ by Kaplan et al. We also provide matching upper and lower bounds of $\tilde{\Theta}(2^{\log^*|X|})$ for the additive error of private quasi-concave optimization (a related and more general problem). Our improvement is achieved via the novel Reorder-Slice-Compute paradigm for private data analysis which we believe will have further applications.
translated by 谷歌翻译
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on how to turn it into one that can be productively studied empirically. We first present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment following meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
translated by 谷歌翻译